1 Calculus

1.1 Limits

Definition 1.1 A sequence of real numbers is a (countably infinite) list \(a_1,a_2,...\) of real numbers. Often, we abbreviate this as \((a_n)_{n\in \mathbb N}\) or just \((a_n)\).

Definition 1.2 Let \((a_n)\) be a sequence of real numbers. We say that a limit of the sequence is \(L\), and write \[\lim_{n\to\infty}a_n = L\] if the sequence gets arbitrarily close to \(L\) for sufficiently large \(n\).

We can make this mathematically precise by digging into the grammar a bit. To say that it is arbitrarily close to \(L\) means that if we pick a target error \(\epsilon\), some small number, we want the \(a_n\) to be within \(\epsilon\) of \(L\), or in other words \(|L-a_n| < \epsilon\). This doesn’t have to happen for all \(a_n\), just those far enough out, meaning there’s a bound \(N\) such that for any index \(n>N\), we are \(\epsilon\)-close.

More concisely, we say that \(a_n\to L\) as \(n\to \infty\) if, for all \(\epsilon > 0\), there exists an \(N\) (depending on \(\epsilon\)) such that for all \(n > N\), we have \[|a_n - L| < \epsilon.\]

A limit of a sequence does not need to exist in general. When it does exist, we say that the sequence converges or is convergent.

Example 1.1 The decimal expansion for a real number is a shorthand for a certain limit. Consider \[\pi = 3.141592653589...\] We can form a sequence \[3,3.1,3.14,3.141,3.1415,3.14159,...\] whose limit is \(\pi\).

Convergent sequences are quite straightforward, even their arithmetic. The following are likely familiar to you from your calculus courses:

Proposition 1.1

A limit of a sequence, if it exists, is unique. Therefore, we can refer to it as the (unique) limit.
If you add two convergent sequences term by term, the limits add accordingly.
If you multiply two convergent sequences term by term, the limits multiply accordingly.
If you scale a convergent sequence by a constant number, the limit scales accordingly.
If you take the reciprocal of a non-zero sequence term by term, the limit also becomes its reciprocal if it is not zero.

Proof. See Exercise 1.1.

Two kinds of sequences, monotone and Cauchy, are special in real analysis. They are uniquely simple, in a certain sense.

Definition 1.3 A sequence is said to be monotone if it is always increasing or always decreasing from term-to-term, meaning that either \[a_{i+1} \geq a_i\ \ \ \ \textrm{or}\ \ \ \ a_{i+1}\leq a_i\] for all \(i\). In the first case, we may say it is monotone increasing, and in the second that it is monotone decreasing. If the inequalities are always strict (\(>\) or \(<\) only, respectively) we say that the sequence is strictly monotone, and either strictly increasing or strictly decreasing, respectively. Constant sequences are monotone but not strictly monotone.

Definition 1.4 A sequence is said to be Cauchy if pairs of terms \(a_m\) and \(a_n\) become arbitrarily close to each other when \(m\) and \(n\) are sufficiently large. Formally, we ask that for all \(\epsilon > 0\), there is a bound \(B\) such that for all \(m,n>B\) we have \[|a_n - a_m| < \epsilon.\] As in the definition of the limit, we think of \(\epsilon\) as a very small number, our target level of closeness.

1.2 Real Numbers

The main subtlety about limits of sequences is existence. For example, why does the sequence of decimal approximations of \(\pi\) (see Example 1.1) actually converge, let alone to \(\pi\)? What about other decimals – why are they sensible notation for real numbers?

In fact, the real numbers are defined/constructed as a natural “home” for limits of rational numbers, like decimal approximations. If a limit of rational numbers could plausibly exist, then it does so in the reals. You will learn more about this in a course on real analysis. For our purposes, the following facts summarize what we need to know about the interaction between limits and real numbers:

Theorem 1.1 Let \((a_n)\) be a sequence of real numbers. Then it converges to a real number in either of the following situations:

\((a_n)\) is monotone and there is a real number \(B\) such that \(|a_n| < B\) for all \(n\).
\((a_n)\) is Cauchy.

The first is especially useful in light of the following fact:

Lemma 1.1 If \((a_n)\) is a sequence of real numbers, then it has a monotone subsequence; in other words, there is an increasing list \(n_1<n_2<...\) such that the sequence \((b_k)\) given by \(b_k = a_{b_k}\) is monotone.

If the original sequence converges, then so do all of its subsequences, and they all converge to the same value. This leads to a natural dévissage for sequences: first find a monotone subsequence and appeal to the monotone convergence theorem, find or characterize the limit of the subsequence, and then try to see if the original converges to it.

For completeness, we mention two more facts equivalent to Theorem 1.1 which we will occasionally use.

Definition 1.5 Let \(S\) be a subset of \(\mathbb R\). A supremum (resp. infinimum) for \(S\) is a real number \(R\) such that \(R>s\) (resp. \(R<s\)) for all \(s\) in \(S\). Sometimes, we say \(+\infty\) is a supremum or \(-\infty\) is an infinimum.

The least upper bound (resp. greatest lower bound) of \(S\) is a supremum (resp. infinimum) which is at least as large as (resp. at least as small as) any other supremum (resp. infinimum) of \(S\).

The least upper bound is denoted \(\sup S\), and the greatest lower bound \(\inf S\).

Theorem 1.2 The following are equivalent facts about the real numbers:

Every monotone sequence converges.
Every cauchy sequence converges.
Every set \(S\) with a supremum has a least upper bound.
Every set \(S\) with an infinimum has a greateset lower bound.
Every sequence of nested intervals has a non-empty intersection.

Some of these might sound or feel trivial, but note that they’re all false for the rational numbers. In fact, they even depend on the notion of size: there are other functions which look like the absolute value but which have very different behavior from the perspective of sequences and limits.

1.3 Continuity and IVT

Continuity is an important property of functions, which essentially says that they transform limits as nicely as one could hope (cf Proposition 1.1). First, we need to define the notion of limits for functions.

Definition 1.6 Let \(f:\mathbb{R}\to \mathbb{R}\) be a function. Similar to sequence limits, we write \[ \lim_{x\to a}f(x) = b, \] and say that \(b\) is a limit of \(f\) at \(x=a\) (or as \(x\) goes to \(a\)) if \(f(x)\) becomes arbitrarily closed to \(b\) on a sufficiently small neighborhoods around \(a\). One can take \(a=\infty\) or \(a=-\infty\), in which case we mean that \(f(x)\) gets arbitrarily close to \(b\) as \(x\) grows very large/small.

Another version of the limit notation is writing “\(f(x)\to b\) as \(x\to a\)” or even “\(f(x) \underset{x\to a}{\longrightarrow} b\)”.

This is more a topic for real analysis, but it’s worth noting that we can make the notion of function limits rigorous in the same way we did for sequences. Instead of \(\epsilon\) and \(N\), target accuracy and index bound, we will have \(\epsilon\) and \(\delta\), target accuracy and nearness to \(a\). But other than that, the definition is the same:

Math	English
\(\epsilon > 0\)	target accuracy
\(\delta > 0\)	nearness to \(a\)
\(\|x-a\| < \delta\)	\(x\) is \(\delta\)-close to \(a\)
\(\|f(x) - L\| < \epsilon\)	\(f(x)\) is \(\epsilon\)-close to \(L\)

This, the \(\epsilon\)-\(\delta\) definition of the limit, is usually written as:

Given a function \(f\) and real number \(a\), the limit of \(f(x)\) as \(x\) goes to \(a\) is \(L\) if for all \(\epsilon > 0\) there exists a \(\delta > 0\) such that \(|f(x) - L| < \epsilon\) whenever \(|x-a| < \delta\).

The limit of a function captures the behavior of \(f\) near \(a\). Similarly, \(\lim_{x\to \infty}f(x)\) captures the behavior of \(f\) when \(x\) is large (“near infinity”), which is also described as the asymptotic behavior of \(f\).

The following real analysis fact relates function and sequence limits:

Proposition 1.2 Let \(f\) be a function. Then \(\lim_{x\to a}f(x) = b\) if and only if for any sequence \(x_n\) approaching \(a\), the sequence \((f(x_n))\) approaches \(b\).

Continuous functions are precisely those which transform limits of sequences “correctly”.

Definition 1.7 A real function \(f\) is said to be continuous at \(x=a\) if its limit at \(a\) agrees with its value. i.e. if \[ f(a) = \lim_{x\to a}f(x). \] We say that \(f\) is continuous if it is continuous everywhere on its domain; more generally we say that \(f\) is continuous on a subset \(A\) of its domain if \(f\) is continuous at all \(a\) in \(A\). No function is continuous on a set not contained in its domain.

Unlike your calculus classes, we do consider functions like \(f(x) = 1/x\) to be continuous. It is not continuous at zero, but zero is not in its domain; a function has no chance of being continuous where it isn’t defined! Almost every familiar non-piecewise function is continuous.

Proposition 1.3 Let \(f\) be a function.

A limit of a function is unique if it exists.
The sum or product of two continuous functions is continuous.
The reciprocal of a non-zero continuous function is continuous.
If \(f\) is continuous at \(x=a\) and \(f(a)=b\), and \(g\) is continuous at \(x=b\), then the composition \(g\circ f\) is continuous at \(x=a\).

1.4 Order of Convergence

Some limits converge very “fast”, like \[\frac 1 {2^{2^n}} \to 0\ \ \textrm{as}\ \ n\to\infty,\] whereas others converge relatively slowly, like \[\sum_{k=1}^n \frac{(-1)^k}{k} \to \ln 2\ \ \textrm{as}\ \ n\to\infty.\]

Casually speaking, fast convergence means that, with respect to the definition of a limit, even a small \(\epsilon\) doesn’t require \(N\) to be too large, whereas slow means that a small \(\epsilon\) requires an enormous \(N\).

We would like to make this precise. With numerical analysis in mind, we will often have methods that iteratively estimate solutions to certain problems, and we’d like to say something about how much accuracy is gained at each step. If convergence is slow, we might want to look for other methods, and if it’s fast, then we wouldn’t want to throw an excess of computing power (and money!) at it.

It turns out that much of calculus can be interpreted in terms of orders of convergence too; you could say that calculus is the study of the arithmetic of linearly-good approximations.

Definition 1.8 Let \(f(x)\) be a real function and \(a\) a point. We say that \(f(x)\) converges to \(0\) faster than linearly as \(x\) goes to \(a\) if we can write \[f(x) = c(x) (x-a)\] for some function \(c(x)\) such that \[\lim_{x\to a} c(x) = 0.\] Equivalently, \[\lim_{x\to a} \frac{f(x)}{x-a} = 0,\] in which case we can define \(c(x)\) to be \(f(x)/(x-a)\) when \(x\neq a\) and \(c(a) = 0\). We might also say that \(f\) converges to zero superlinearly as \(x\) goes to \(a\), or that \(f\) is much smaller than linear at \(x=a\).

The second variation suggests the reason for saying that \(f\) is much smaller than linear. The denominator goes to zero, so the limit will only exist if the numerator also goes to zero (meaning \(f\) is small at \(x=a\)). But not only does it exist, it goes to zero, rather than balancing at some finite number. That means \(f(x)\) must be even smaller than \(x-a\) for \(x\) near \(a\).

Superlinear convergence naturally suggests “superquadratic” convergence, where \((x-a)^2\) appears instead of \(x-a\). More generally, we can define convergence of arbitrary order.

Definition 1.9 Let \(f(x)\) be a real function and \(a\) a point. We say that \(f(x)\) converges to \(0\) with order \(n\) if \[f(x) = c(x) (x-a)^n\] for some function \(c(x)\) such that \[\lim_{x\to a} c(x) = 0.\] This is again equivalent to \[\lim_{x\to a} \frac{f(x)}{(x-a)^n} = 0.\]

We may also write this with “little-o” notation: \[f(x) = \mathcal{o}(g(x))\hspace{2em}(x\to a)\]

This is a good way to compare functions: how can we make statements like “\(f(x)\) is really close to \(g(x)\) near \(x=a\)” more mathematically precise? This should be the same as saying that \(E_{f,g}(x) = f(x) - g(x)\) is quite small near \(x=a\). Order is one way of making this precise.

Definition 1.10 Let \(f\) and \(g\) be real functions. We say that they are equal up to order \(n\) (or equal modulo order \(n\)) at \(a\) if \(f(x) - g(x)\) converges to zero with order \(n\). Or, in other words, if \[E_{f, g}(x) = f(x) - g(x) = \mathcal{o_{x\to a}}((x-a)^n)\]

1.5 (Taylor) Polynomial Approximation

Now, we can apply the notion of “equal up to order \(n\)” to the problem of polynomial approximation. More precisely, given a function \(f\) and \(a\in\mathbb{R}\), we want to find, if possible, a polynomial \(P\) of degree at most \(n\) such that

\(P(a) = f(a)\)
\(P\) is equal to \(f\) up to order \(n\) at \(a\).

Definition 1.11 If such a function exists, we call it the \(n\)th (Taylor) polynomial approximation to \(f(x)\) at \(x=a\).

Notice that if \(P(x)\) satisfies just the first condition, not the degree stipulation, then so too does \(P(x) + (x-a)^{n+1}Q(x)\) for any polynomial \(Q(x)\) because

\[\begin{align*} \lim_{x\to a}\frac{f(x) - [P(x) + (x-a)^{n+1}Q(x)]}{(x-a)^n} &= \lim_{x\to a}\frac{f(x) - P(x)}{(x-a)^n} - \lim_{x\to a}\frac{(x-a)^{n+1}}{(x-a)^n}Q(x),\\ &= \lim_{x\to a}\frac{f(x) - P(x)}{(x-a)^n} - \lim_{x\to a}(x-a)Q(x),\\ &= 0 + 0 = 0. \end{align*}\]

Therefore, if we drop the degree requirement on \(P\), terms with degree \(\geq n+1\) can literally be anything. This is where the degree requirement comes from – it’ll guarantee uniqueness, so that it makes sense to refer to the \(n\)th Taylor polynomial approximation.

Theorem 1.3 The \(n\)-th order Taylor polynomial of a function \(f\) at \(a\) is unique (if exists).

Proof. We only need to show that if \(P\) and \(Q\) are two \(n\)-th order Taylor polynomials of the same function \(f\) at the same point \(a\), they must be equal. Now assume that \(P\) and \(Q\) are both equal \(f\) up to order \(n\) and thus they equal to each other up to order \(n\).

By assumption, we know two things:

\[ \lim_{x\to a} \frac{f(x) - P(x)}{(x-a)^n} = 0 \]

\[ \lim_{x\to a} \frac{f(x) - Q(x)}{(x-a)^n} = 0 \]

Subtracting the two and applying limit rules leads us to

\[ \lim_{x\to a} \frac{P(x)-Q(x)}{(x-a)^n} = 0 \]

The numerator is a difference of polynomials of degree at most \(n\), which we can write out in coefficients around \(x=a\), \[ P(x) - Q(x) = c_0 + c_1(x-a) + \dots + c_n(x-a)^{n}. \]

Now, we have \[ 0 = \lim_{x\to a} \frac{P(x) - Q(x)}{(x-a)^n} = \lim_{x\to a} \frac{c_0 + c_1(x-a) + \dots + c_n(x-a)^{n}}{(x-a)^n}. \]

The denominator goes to zero, so the limit can only exist if the numerator goes to zero. But the numerator is a polynomial, hence continuous, so that means the numerator has a root at \(x=a\), and therefore \(c_0 = 0\). Canceling a common factor doesn’t change the limit, and thus

\[ 0 = \lim_{x\to a} \frac{c_1 + \dots + c_n(x-a)^{n-1}}{(x-a)^{n-1}}. \]

Repeating the previous reasoning, we get \(c_1=0\), then \(c_2=0\) and so on. We won’t run out of \((x-a)\)’s in the denominator because the degree of the numerator is at most \(n\). Therefore, all the \(c_i\) are zero, so \(P(x) - Q(x) = 0\), as was to be shown (alternatively, one could prove by more analytic means that the first nonzero term \(c_k/(x-a)^{n-k}\) dominates the others).

1.6 Derivative

The derivative turns out carry the same information as the \(1\)st Taylor polynomial approximation, the linear one. Any Taylor polynomial for \(f(x)\) at \(x=a\) must pass through the point \((a,f(a))\), so the only information necessary to specify a line is its slope, because every such line takes the form \[L_m(x) = m(x-a) + f(a).\] That slope, \(m\), is the derivative. Of course, the line has to actually be a good linear approximation in the sense of the previous section. This is why derivatives do not always exist.

Linear approximations are desirable because lines are simple, and they have a lot of nice properties. Even though we expect higher order approximations to be more accurate (not always!) they could be harder to calculate. Plus, it turns out that higher order approximations can be calculated from a sequence of linear approximations.

Definition 1.12 The derivative of \(f\) at \(a\) is the slope of the best linear approximation (BLA) to \(f(x)\) at \(x=a\). This is the number \(m\) (a slope) such that the line \[L_m(x) = m(x-a) + f(a)\] is the first-order Taylor approximation to \(f(x)\) at \(x=a\). In other words, \(m\) must satisfy

\[\lim_{x\rightarrow a} \frac{f(x) - f(a) - m(x-a)}{x-a} = 0.\]

Or, equivalently, that \[f(x) = f(a) + m(x-a) + c(x)(x-a)\] for some function \(c\) such that \(c(a)=0\) and \(c\) is continuous at \(x=a\). ::

This number need not exist in general. When it does, we say that \(f\) is differentiable at \(a\). Typical notation for the derivative (the number \(m\)) include: \(f'(a)\), \(\frac{df}{dx}(a)\), or \(\left.\frac{df}{dx}\right|_{x=a}\). The first two naturally arise from view the derivative of \(f\) as a function written \(f'\) or \(\frac{df}{dx}\) which assigns to each \(a\) the corresponding slope.

Then differentiation itself is an operator a function from functions to functions – in fact, it is even a linear operator.

Referring to the definition of order, it is easy to see that the derivative can be expressed as a limit:

\[f'(a) = \lim_{x\rightarrow a} \frac{f(x) - f(a)}{x-a}.\]

The quotient on the right hand side is naturally interpreted as the slope of the secant line through \((a,f(a))\) and \((x,f(x))\), the familiar limit definition from calculus. However, to see that this new definition really is equivalent to the usual one, we also need to show that when the usual derivative exists, it meets our criterion (homework).

There is a further way to write down the definition of the derivative. Consider the formula \[ f(x) - L_m(x) = c(x)(x-a). \]

We can rewrite this as \[ f(x) - f(a) = (m+c(x))(x-a) \] Therefore, \(f'(a)\) can also be defined as a value of \(m\) that make this true for some \(c(x)\) satisfying \(\displaystyle \lim_{x\to a}c(x) = 0\). This form is convenient when we prove the composition rule later.

As we have defined it, the derivative of \(f\) can be viewed as the slope of the (asymptotically) best linear approximation to \(f\) at \(x=a\). The notion of order tells us that the error in the estimate, given by \[f(x) - [f(a) + f'(a)(x-a)],\] is smaller (in a precise sense!) than any linear function. This is part of why the derivative is so helpful, it helps you estimate functions accurately.

Example 1.2 Suppose you’re told just a little information about a mystery function \(f(x)\): you know \(f(0) = 1\), and \(f'(x) = f(x)\). What is \(f(1)\)? At a first pass, we know \[f(x) \approx f(0) + f'(0)(x-0)\] for \(x\approx 0\). We know that \(f'(0) = f(0) = 1\), so we get \[f(x) \approx 1+x\] and so \(f(1) \approx 2\). Since it’s asymptotically best, we’d get a better result for values of \(x\) closer to zero. Take \(x=1/2\), and we can estimate \[f(0.5) \approx 1.5.\] But then we can start over with a fresh BLA using that information, \(f'(0.5) = f(0.5) \approx 1.5\) and hence \[f(x) \approx 1.5 + 1.5(x-0.5)\] for \(x\) near \(0.5\). If we then set \(x=1\), we arrive at \[f(1) \approx 1.5 + 1.5(0.5) = 2.25.\]

If one repeats this with smaller and smaller steps, we expect ever more accurate estimates for \(f(1)\). This function is actually the exponential function \(e^x\), and so \(f(1) \approx 2.7\), which means our guess isn’t all that accurate. However, it was very quick: the only operations necessary for the estimate, because it’s the best linear approximation, are normal addition and multiplication. With more patience, or a calculator, we could easily get an answer accurate to many decimal places.

1.7 Derivative Rules

You are likely familiar with many derivative rules from your calculus classes. We will prove two of them here, and leave several others for your homework. A key idea is that the derivative lets us replace functions with lines, and (almost) all of the derivative rules correspond to operations on lines: what is the slope when you add two lines, for example?

The first rule we shall prove here is the product rule. In a real analysis course, one usually proves this by brute force. Here, we use a different argument that makes the above intuition rigorous while emphasizing the idea of approximation.

To arrive at the product rule, consider what happens when you multiply two lines: \[(a+bx)(c+dx) = ac + (ad +bc)x + bdx^2 \approx ac + (ad+bc)x\] The top degree term is small, so we dropped it, and the remaining “slope” of \(ad+bc\) has cross-terms involving the constant term and slope of each line. Slopes are derivatives, so we expect that \((fg)'\) will be of a similar form.

To prove this, we need a lemma to justify dropping small terms, essentially proving that order of convergence and multiplication are compatible. Then we prove the product rule by following the line calculation.

Lemma 1.2 Suppose \(f\) and \(\tilde f\) are equal up to order \(n\) at \(a\) and so are \(g\) and \(\tilde g\). Assume that all four of them admit limits at \(a\). Then, \(fg\) and \(\tilde f\tilde g\) are equal up to order \(n\) at \(a\)

Proof. \[\begin{align*} (fg)(x) - (\tilde f\tilde g)(x) = f(x)g(x) - \tilde f(x)\tilde g(x) &= f(x)\big(g(x) - \tilde g(x)\big) + \tilde g(x)\big(f(x) - \tilde f(x)\big)\\ &= f(x)c_g(x)(x-a)^n + \tilde g(x)c_f(x)(x-a)^n\\ &= \big(f(x)c_g(x)+ \tilde g(x)c_f(x)\big)(x-a)^n \end{align*}\]

Now notice that \[ \lim_{x\to a } \big(f(x)c_g(x)+ \tilde g(x)c_f(x)\big) = 0 \] Then, by definition \(fg\) and \(\tilde f\tilde g\) are equal up to order \(n\) at \(a\).

Theorem 1.4 Suppose \(f\) and \(g\) are differentiable at \(a\). Then, \[(fg)'(a) = f'(a)g(a) + f(a)g'(a).\]

Proof. By definition, \(f(a) + f'(a)(x-a)\) and \(g(x) + g'(a)(x-a)\) are the unique linear approximations of \(f\) and \(g\) at \(a\), respectively. Now, by the preceding theorem, we see that \[\big(f(a) + f'(a)(x-a)\big)\big(g(x) + g'(a)(x-a)\big) = f(a)g(a) + \big(f(a)g'(a) + f'(a)g(a)\big)(x-a) + f'(a)g'(a)(x-a)^2\] is equal to \((fg)(x)\) up to order 1. For such polynomials, we saw already that the higher degree term can be neglected (cf the discussion of uniqueness of Taylor approximations) and that the less-than-linear part is unique, so the unique linear approximation of \(fg\) at \(a\) is given by \[ f(a)g(a) + \big(f(a)g'(a) + f'(a)g(a)\big)(x-a)\] Therefore, we have \[(fg)'(a) = f'(a)g(x) + f(a)g'(a).\]

What if you add two lines? You’ll get \[(ax+b) + (cx +d) = (a+c)x + (b+d).\] Adding lines adds slopes, which suggests that adding functions adds derivatives (see Exercise 1.3)

What happens when you compose two lines? Well, \[(a+bx)\circ (c+dx) = a + b(c+dx) = a + bc + bdx.\] The slopes end up multiplying. So we should expect \((f\circ g)'\) to be a product of \(f'\) and \(g'\). The only small twist is making sure we multiply the correct two slopes. These are easy to guess: \((f\circ g)(a)\) requires finding \(g(a)\), suggesting \(g'(a)\), and then calculating \(f(g(a))\), which suggests \(f'(g(a))\). Reasoning with lines, therefore, leads us to think \[(f\circ g)'(a) = f'(g(a))g'(a),\] which is indeed the correct composition rule. As above, we can basically follow the linear reasoning.

Theorem 1.5 Let \(f\) and \(g\) be two real functions. Suppose \(g\) is differentiable at \(x=a\) and \(f\) is differentiable at \(x=g(a)\). Then \(f\circ g\) is differentiable at \(x=a,\) and \[(f\circ g)'(a)= f'(g(a))g'(a)\]

Proof. For variety, let’s use a different definition to prove this. The benefit of having different equivalent definitions is that you can switch freely between them!

By assumption, we have

\[\begin{align*} g(x) - g(a) &= \big(g'(a) + c_g(x)\big)(x-a)\\ f(y) - f(g(a)) & = \big(f'(a)+ c_f(y)\big)(y-g(a)) \end{align*}\]

Then, \[\begin{align*} f(g(x)) - f(g(a)) &= \big(f'(a) +c_f(g(x))\big)(g(x)-g(a))\\ &= \big(f'(g(a)) + c_f(g(x))\big)\big(g'(a) + c_g(x)\big)(x-a)\\ &= \bigg[f'(g(a))g'(a) + g'(a)c_f(g(x)) + f'(g(a)) c_g(x) + c_f(g(x))c_g(x) \bigg](x-a)\\ &= f'(g(a))g'(a)(x-a) + \bigg[g'(a)c_f(g(x)) + f'(g(a)) c_g(x) + c_f(g(x))c_g(x) \bigg](x-a)\\ \end{align*}\]

After dividing by \(x-a\) and canceling, we still know that the last three terms (those in the brackets) go to \(0\) when \(x\to a\): differentiability implies continuity, so \(g(x)\to g(a)\) as \(x\to a\)), so \(c_f(g(x))\to c_f(g(a))=0\), and the \(c_g\) go to zero by definition. Therefore, \((f\circ g)'(a)= f'(g(a))g'(a)\).

1.8 Derivatives in Sage

Sage makes it very easy to calculate symbolic derivatives of standard functions:

1x = var('x')

f(x) = x^3 - 3*x + 2

2show(f.derivative())
show(derivative(f))

a = 5
3show(f.derivative()(a))
show(derivative(f)(a))

1: Define a symbolic variable and (callable) symbolic function.
2: Two syntax options for symbolic derivatives.
3: Two syntax options for evaluating the derivative.

\(\displaystyle x \ {\mapsto}\ 3 \, x^{2} - 3\)

\(\displaystyle 72\)

1.9 The Mean Value Theorem

The Mean Value Theorem (MVT) is one of the most powerful results in real analysis. There are several variations, not all equivalent, including: the fundamental theorem of the derivative, the inequality form/racetrack principle, the secant form, and the Cauchy MVT.

Among other applications, it makes precise the following intuition: if you know where a function starts, and you know how fast it grows, then you know the function (the uniqueness theorem for simple ODEs!). This fact turns out to be deceptively subtle.

The following is sometimes attributed to Fermat.

Lemma 1.3 Let \(f\) be a real function which attains a local extremum at \(c\). If \(f\) is differentiable at \(c\), then \(f'(c) = 0\).

Proof. Let’s suppose the extremum is a local max; the argument for local mins is identical (or use \(-f\) and appeal to the derivative rules).

Recall that

\[f'(c) = \lim_{x\to c}\frac{f(x) - f(c)}{x-c}\]

Now, since \(f(c)\) is a local max, there is a neighborhood \((c - \delta, c + \delta)\) around \(c\) such that for all \(x\) in this neighborhood, \(f(x) \leq f(c)\).

Thus, for any \(x\) such that \(x<c\), we have \(\displaystyle \frac{f(x) - f(c)}{x-c}\geq 0\), and for any \(x>c\) we have \(\displaystyle \frac{f(x) - f(c)}{x-c}\leq 0\). Therefore, when \(x\) approaches \(c\) from the left, \[\lim_{x\to c^-}\frac{f(x) - f(c)}{x-c}\geq 0,\] and when \(x\) approaches from the right, \[\lim_{x\to c^+}\frac{f(x) - f(c)}{x-c}\leq 0.\] For the limit to exists, the left and right limit must coincide. This only leaves one possibility, zero, so \[f'(c) = \lim_{x\to c}\frac{f(x) - f(c)}{x-c} = 0.\]

Theorem 1.6 Let \(f\) be continuous on \([a, b]\) and differentiable on \((a, b)\). Suppose \(f(a) = f(b)\). Then, there is some \(c\in (a, b)\) where \(f'(c) = 0\).

Proof. Since \(f\) is continuous on \([a, b]\), it attains maximum and minimum values. Suppose the maximum or the minimum is attained at some point \(c\) in the open interval \((a, b)\), where \(f\) is differentiable, we can appeal to the preceding theorem to show that \(f'(c) = 0\).

Otherwise, neither maximum nor minimum are not attained in the open interval, which means that they are both attained at the end points. However, \(f(a) = f(b)\), so this value is both the minimum and the maximum of \(f\) on \([a, b]\), which implies that \(f\) is a constant on \([a,b]\). In this case, any point \(c\in(a, b)\) satisfies \(f'(c) = 0\).

We don’t actually need to assume the continuity at endpoints. We only need to assume that the limits of \(f\) at \(a\) and \(b\) both exist and are equal. As far as I know, this assumption lingers for historical reasons. If the limits exist and are continuous, we can always extend them by defining the values at endpoints to be the corresponding limits. Notice that this operation doesn’t change the derivatives of \(f\) and \(g\) on \((a, b)\) (derivatives only depend on the behavior of a function in arbitrarily small neighborhoods, and any \(x\in(a, b)\), you can find a small enough neighborhood of \(x\) to separate it from the endpoints).

With this, we can prove Cauchy’s MVT. The MVT you may know from calculus is the special case \(g(x)=x\), which arises naturally from the secant interpretation of the derivative. In essence, Cauchy’s MVT upgrades us from derivatives with respect to \(x\) to derivative with respect to other functions. You’ve likely seen things like \[\frac{df}{d\ln},\ \ \ \ \frac{df}{de^x},\ \ \ \ \frac{df}{dg}\] in your other courses, and this theorem tells us that they behave as nicely as their notation suggests. In particular, l’Hôpital’s rule will follow quickly from this theorem.

Theorem 1.7 Let \(f\) and \(g\) be continuous on \([a, b]\) and differentiable on \((a, b)\). Then, there is some point \(c\in (a, b)\) where \[ f'(c)(g(b) - g(a)) = g'(c)(f(b) - f(a)) \]

First, some motivation. Recall from linear algebra that two vectors \((x_1, y_1)\) and \((x_2, y_2)\) are parallel if and only if \(x_1y_2 = x_2y_1\). Therefore, if we let \[\vec{h}(x) = \big(f(x), g(x)\big),\] what we need to prove is that there is some \(c\in(a, b)\) where the vector \[\vec{h}'(x) = \big(f'(x), g'(x)\big)\] is parallel to \[\vec{h}(b) - \vec{h}(a) = \big(g(b) - g(a), f(b) - f(a)\big).\] aside. The code block in the companion notebook implements a simple demo. Geometrically, we want to find a value of \(c\) such that the red and the green line are parallel. If you draw a graph for Rolle’s theorem, you would see the scenarios are exactly the same, except that this time our diagram is tilted. Therefore, to reduce this case to Rolle’s theorem, we want to define a function \(t(x)\) that reflects the “vertical position” of \(\vec{h}(x)\) in the direction perpendicular to red line, so that \(t(a) = t(b)\) and we can apply Rolle’s theorem on \(t(x)\). One way to do this is by taking the dot product of \(\vec{h}(x)\) with some fixed vector that is perpendicular to \(\big(\vec{h}(b) - \vec{h}(a)\big)\) (the red line). A convenient choice is to exchange \(x\) and \(y\) coordinate and then slip the sign of the new \(x\) coordinate. This gives \(\big(g(a) - g(b), f(b) - f(a)\big)\).)

Proof. Regardless of whether you like the above geometric motivation, you can verify that if we define \[ t(x) = f(x)(g(a) - g(b)) + g(x)(f(b) - f(a)), \] then we have \(t(a) = f(b)g(a) - g(a)f(b) = t(b)\) and that \(t(x)\) is differentiable on \((a, b)\) while continuous on \([a, b]\). Now, by Rolle’s theorem, there is some point \(c\in (a, b)\) where \[ 0 = t'(c) = f'(c)(g(a) - g(b)) + g'(c)(f(b) - f(a)) \] Therefore, at this point \(c\) we have \(f'(c)(g(b) - g(a)) = g'(c)(f(b) - f(a))\).

Similar to the case of Rolle’s theorem, this theorem is still true if we don’t assume the continuity at the end points but only the existence of limits. Of course, in this case we need to replace \(f(a)\), \(f(b)\), \(g(a)\), \(g(b\) by the corresponding limits. We may call this slightly stronger result as the extended Cauchy’s MVT.

1.10 l’Hôpital’s Rule

Interestingly, l’Hôpital did not prove his rule. A mathematician in his employ (well, under his patronage) proved it, so his name was attached. It’s a handy way of calculating limits.

Note that the special case \(g(x) = x-a\) furnishes the definition of the derivative – what is remarkable is that one can “cancel” the \(x-a\) when “dividing” one definition of the derivative by another.

Theorem 1.8 Suppose that \(f\to 0\) and \(g\to 0\) as \(x\to a\), and that \[\lim_{x\to a} \frac{f'(x)}{g'(x)}\] exists. Then the limit \[\lim_{x\to a} \frac{f(x)}{g(x)}\] also exists, and \[ \lim_{x\to a} \frac{f(x)}{g(x)} =\lim_{x\to a} \frac{f'(x)}{g'(x)}. \]

Proof. Recall the useful real analysis fact that \[\lim_{x\to a}f(x) = b\] if and only if for any sequence \((x_i)\) approaching \(a\) but not containing \(a\), the sequence \((f(a_i))\) also approaches \(b\). So take such a sequence and let \(\displaystyle L = \lim_{x\to a} \frac{f'(x)}{g'(x)}\). Then, by this fact, we only need to show that \[ \lim \frac{f(x_i)}{g(x_i)} = L. \]

For each \(x_i\), apply the (extended) Cauchy MVT on the interval \((a, x_i)\) (or \((x_i, a)\) if \(x_i > a\)) with \(f\) and \(g\). We obtain a sequence \(c_i \in (a, x_i)\) where \[f'(c_i)(g(x_i) - 0) = g'(c_i)(f(x_i) - 0)\] or equivalently \[ \frac{f'(c_i)}{g'(c_i)} = \frac{f(x_i)}{g(x_i)} \]

Now, \(x_i\to a\) and \(c_i\) is in between \(x_i\) and \(a\), so \(c_i\) is also a sequence approaching \(a\) but not containing \(a\). Therefore, by the other direction of the useful fact, we have \[\lim \frac{f(x_i)}{g(x_i)} = \lim \frac{f'(c_i)}{g'(c_i)} = L.\]

1.11 Higher Order Derivatives and Taylor Polynomials

Definition 1.13 We define the second derivative of \(f\) at \(a\) to be the derivative of its derivative function at \(a\). Third and higher derivatives are similarly defined. We write \[f^{(n)}(a)\] to denote the \(n\)th order derivative of \(f\) at \(a\). If it exists, we say that \(f\) is \(n\)-times differentiable (on some domain, at some point). You can define it inductively as \[f^{(n)}(a) = (f^{(n-1)})'(a)\]

Some people would naturally suggest defining higher derivatives as coefficients of higher degree approximations. The derivative is the slope of the first order approximation, so one might want to define the second derivative to be the leading coefficient of the second order approximation. In other words, if \[a + b(x-a) + c(x-a)^2\] is an order two approximation to \(f\) at \(x=a\), it seems reasonable to define the second derivative to be \(c\) – it’s not hard to see that \(b=f'(a)\), so this would give us \[f(a) + f'(a)(x-a) + f''(a)(x-a)^2\] as a second order approximation, and so on.

For historical reasons, this is not our definition, and would give you the wrong numbers; in fact, the leading coefficient is \(f''(a)/2\) by the standard definitions. The only advantage of the standard form is that you just iterate through without having to think. Interestingly, this is the right definition for parts of number theory.

Theorem 1.9 Let \(n\geq1\) be an integer, and let \(f\) and \(g\) be two functions whose derivatives exist at \(a\) up to order \(n\). Then, \(f\) and \(g\) are equal up to order \(n\) at \(a\) if \(f(a) = g(a)\) and the derivatives up to order \(n\) all agree.

Proof. If \(n = 1\), then since \(f(a) = g(a)\) we have

\[ \lim_{x\to a}\frac{f(x) - g(x)}{x-a}= \lim_{x\to a}\bigg(\frac{f(x) - f(a)}{x-a} - \frac{g(x) - g(a)}{x-a}\bigg) = f'(a) - g'(a) = 0 \]

If \(n\geq 2\), then \(f'\) and \(g'\) are continuous at \(a\), so we can apply l’Hôpital’s Rule to obtain \[ \lim_{x\to a}\frac{f(x) - g(x)}{(x-a)^{n}} = \lim_{x\to a}\frac{f'(x) - g'(x)}{n(x-a)^{n-1}} \]

We need the continuity of \(f'\) and \(g'\) to show that the limit on the right-hand-side exists, which has to be established before applying l’Hôpital’s Rule. Since \(n\) in the denominator is just a constant, we see that when \(n\geq 2\), \(f\) and \(g\) being equal up order \(n\) reduces to \(f'\) and \(g'\) being equal up to order \(n-1\). After repeating this reduction \(n-1\) times, the proof is reduced to showing that \(f^{(n-1)}\) and \(g^{(n-1)}\) are equal up to order \(1\). This is the first case: we assumed that \(f^n(a)\) and \(g^n(a)\) are equal.

Corollary 1.1 Continue from the assumptions of Theorem 1.9, and define

\[ P_{f, n, a}(x) = \sum_{k = 0}^{n}\frac{f^{(k)}(a)}{k!}(x - a)^k \]

Then \(P_{f, n, a}(x)\) is the unique \(n\)-th order Taylor approximation of \(f\) at \(a\).

Proof. Just check that the derivatives match up.

1.12 Lagrange’s Remainder

Given the degree \(n\) approximation \[ f(x) \cong P_{f,n,a}(x),\] in a neighborhood of \(x=a\), we would like some information about the accuracy of this estimate. In other words, can we find an upper bound on \[|f(x) - P_{f,n,a}(x)|\] in terms of known quantities? It should depend on \(f,n,a\) and \(x\) – if \(x\) is far from \(a\), we expect a worse estimate. As a simple case, consider the best constant approximation \[P_{f,0,a}(x) = f(a).\]

An upper bound for \[|f(x) - f(a)|\] can be obtained from the mean value theorem. We know that for given \(x,a\) there is some \(c\) between them such that \[\frac{f(x) - f(a)}{x-a} = f'(c)\] and therefore \[|f(x) - f(a)| = |f'(c)| |x-a|\] which means that a bound on \(f'\) will give us a bound on \(|f(x) - f(a)|\). We can generalize this to higher order approximations.

Lemma 1.4 Let \(x,a\) be real numbers. Suppose that \(R\) is \(n+1\) times differentiable on an interval containing \(x\) and \(a\) and that \(R^{(k)}(a) = 0\) for \(0\leq k \leq n\). Then there is some \(t\) between \(x\) and \(a\) such that

\[ \frac{R(x)}{(x-a)^{n+1}} = \frac{R^{(n+1)}(t)}{(n+1)!}. \]

Proof. We will proceed by induction on \(n\). The claim is true for \(n=0\) by the mean value theorem.

Apply Cauchy’s MVT to \(R\) and \((x-a)^{n+1}\) on the interval \([x,a]\). It tells us that we can find \(t'\) between \(x\) and \(a\) such that

\[ \frac{R(x)-R(a)}{g_1(x)-g_1(a)} = \frac{R'(t')}{g_1'(t')} = \frac{R'(t')}{(n+1)(t'-a)^n}. \]

Simplifying with \(R(a) = g(a) = 0\), we obtain \[ \frac{R(x)}{(x-a)^{n+1}} = \frac{1}{n+1}\,\frac{R'(t')}{(t'-a)^n}.\]

By induction, the lemma holds for \[\frac{R'(t')}{(t'-a)^n}\] with \(n-1\) so we can find some \(t\) between \(a\) and \(t\) (hence between \(a\) and \(x\)) such that \[\frac{R'(t')}{(t'-a)^n} = \frac{(R')^{(n)}(t)}{n!}.\]

Combine this with our earlier equality, and note \((R')^{(n)} = R^{(n+1)}\) to conclude \[ \frac{R(x)}{(x-a)^{n+1}} = \frac{(R')^{(n)}(t)}{(n+1)n!} = \frac{R^{(n+1)}(t)}{(n+1)!}. \]

Theorem 1.10 Let \(x,a\) be real numbers. Suppose \(f\) is \((n+1)\)-times differentiable on an interval containing \(x\) and \(a\), and define let

\[ R_{n,a}(x) = f(x) - P_{f, n, a}(x) \]

be the error in the Taylor approximation.

Then there exists some \(t\) between \(x\) and \(a\) such that

\[ R_{n,a}(x) = \frac{f^{(n+1)}(t)}{(n+1)!}\,(x-a)^{n+1}. \]

Proof. By definition of the Taylor polynomial, \(R_{n,a}\) satisfies the hypotheses of Lemma 1.4. Hence there is some \(t\) between \(x\) and \(a\) with

\[ \frac{R_{n,a}(x)}{(x-a)^{n+1}} = \frac{R_{n,a}^{(n+1)}(t)}{(n+1)!}. \]

Soince \(R_{n,a}(x)-f(x) = P_{f,n,a}(x)\) is a polynomial of degree \(n\), its \((n+1)\)-th derivative vanishes and moreover

\[ R_{n,a}^{(n+1)}(t)=f^{(n+1)}(t). \]

Substituting gives

\[ R_{n,a}(x) = \frac{f^{(n+1)}(t)}{(n+1)!}\,(x-a)^{n+1}, \] as claimed.

Example 1.3 Let \(f(x) = e^x\), and consider its Taylor polynomials around \(x=0\). Since \(f^{(n)}(x) = e^x\), we can always bound \[|f^{(n)}(x) \leq \max \{1,e^x\}.\]

Let \(P_n(x)\) be its order \(n\) Taylor approximation around \(x=0\). Then we see that the error is bounded as follows \[|e^x - P_n(x)| \leq \frac{\max \{1,e^x\}}{(n+1)!} |x|^{n+1}.\]

If \(|x| < 1\) this is a fairly good bound. When \(n\) is large, the \((n+1)!\) in the denominator also helps reduce the error (although if \(|x| > 1\) then one has to be a little careful to balance thgainst the increasing \(|x|^{n+1}\)).

If we want to estimate \(e^{1.5}\) to \(10\) decimal places, we need to find \(n\) such that \[\frac{\max \{1,e^x\}}{(n+1)!} |x|^{n+1} \leq 10^{-10}.\]

Making a crude estimate \(e^{1.5} < e^2 < 3^2 = 9 < 10\), we reduce to \[\frac{1.5^{n+1}}{(n+1)!} < 10^{-11},\] which holds at \(n=16\).

Example 1.4 Consider \(\ln(x)\) around \(x=0\). Its derivatives are

\[\ln^{(n+1)}(x) = (-1)^{n} (n+1)! x^{-(n+1)},\]

so the remainder formula says

\[\ln(x) - P_{\ln,n,1} = \frac{(-1)^{n} (n+1)! t^{-(n+1)}}{(n+1)!}(x-1)^{n+1} = (-1)^{n} \left(\frac{x-1}{t}\right)^{n+1},\]

for some \(t\) between \(x\) and \(1\). The cancellation of the \((n+1)!\) in the denominator means that this error won’t decrease as quickly as a function of \(n\).

For example, suppose we’d like to approximate \(\ln(1.5)\) to \(10\) decimal places. Then \(t\) is between \(1\) and \(1.5\), so \[\frac{x-1}{t} \leq 0.5\] which means we want \(n\) such that \[10^{-10} \geq |(-1)^{n} 0.5^{n+1}| = 0.5^{n+1}.\] The logarithm respects inequalities, so we want to solve \[ - 10\ln 10 \geq (n+1) \ln 0.5\] \[n+1 \geq -10 \ln 10 / \ln 0.5 \approx 33.2,\]

which means a degree around \(33\) is required. Even though \(\exp\) and \(\ln\) are fairly closely related, it is much harder to estimate \(\ln\) with Taylor polynomials. In fact, we know that the Taylor series for \(\ln\) around \(x=1\) does not converge for \(x > 2\). The Taylor errors actually increase as a function of \(n\) when \(x > 2\).

1.13 (Constrained) Optimization

The idea of calculus as a machine for producing approximations naturally suggests how we can use it to solve another problem, optimizing functions (i.e. finding the extrema, the largest/smallest values a function takes on its domain). Consider a line with nonzero slope. At a given point, line does not achieve or a maximum unless it’s at an endpoint of the domain, because it will be larger on one side and smaller on the other. If a function \(f\) is well-approximated by a line with nonzero slope, then, we expect the same issue. So minima and maxima should only occur at boundaries or where the slope (derivative) is zero.

More generally, a function should be increasing where its derivative is positive, and decreasing where its derivative is negative. This gives rise to a way to locate minima and maxima.

Theorem 1.11 Suppose \(f\) is differentiable on \([a,b]\). The extrema of \(f\) occur at either the endpoints \(a\) or \(b\), or at interior points \(c\) where \(f'(c) = 0\).

Moreover, one can classify these points as local minima, local maxima, or neither according to the sign of the derivative on either side. (increasing then decreasing is a maximum, decreasing then increasing is a minimum, increasing up to an endpoint is a maximum, and so on, ambiguous cases are neither)

The first part of this theorem continues to hold in higher dimension: best-linear approximatinos still make sense, so when maximizing a function from a subset of \(\mathbb R^n \to \mathbb R\) the same reasoning as precedes the theorem tells us that non-boundary extrema can only occurs where the approximating hyperplane is parallel to the “\(\mathbb R^n\)-axis”. This happens precisely when the (total) derivative is zero.

But how should one handle extrema on the boundary? In the \(\mathbb R\to\mathbb R\) case, there are just two endpoints to check by hand, whereas in higher dimension one could have larger boundaries. For example, maximizing a function on the ball would break into the derivative problem on the interior then a separate optimization problem on its boundary, the sphere.

Let \(f\) be the function to maximize, and \(c\) a constraint, so that we are trying to find the extrema of \(f\) on the set of points cut out by \(c = 0\). For simplicitly, let’s work in \(\mathbb R^2\).

1.13.1 Implicit Functions

The most natural way to proceed is to view \(c(x,y) = 0\) as turning \(y\) into an implicit function of \(x\) (or vice-versa). This is exactly like related rates from Calculus I – which were actually multivariable calculus questions in disguise (take a look at the optimization problems you did right after related rates!!).

Framed this way, we are just trying to optimize the single-variable function \(f(x,y) = f(x,y(x))\). Using the composition rule, the derivative \(\frac {df}{dx}\) of \(f(x,y(x))\) will involve some \(y\) terms. Rather than actually solving for \(y\) in terms of \(x\), we solve the two-variable system of equations \[\frac{df}{dx} (x,y) =0\ \ \ \ c(x,y) = 0.\]

Example 1.5 Suppose you want to maximize \(f(x,y) = x^2 + 3xy + y^2\) on the circle \(x^2 + y^2 = 2\). View \(y\) as a function of \(x\). The second equation tells us that \[\frac{dy}{dx} = - \frac x y.\] Therefore, if we view \(g(x) = f(x,y(x))\) as a single variable function, \[g'(x) = 2x + 3y - \frac x y (3x + 2y).\]

Then, we just want to solve the system

\[\begin{align*} 0 &= 2x + 3y - \frac x y (3x + 2y),\\ 0 &= x^2 + y^2 - 2. \end{align*}\]

The solution turns out to be \(x=\pm 1, y = \pm 1\).

That would tell us the extrema on the boundary’s interior; as necessary, we might also investigate the boundary of the boundary.

1.13.2 Thickening - Lagrange Multipliers

There is another way to approach the problem. The difficulty with constrained optimization is almost the same as in one dimension: there, if we tried to restrict to just the two endpoints, the derivative wouldn’t make any sense. More geometrically, restricting to the constraint leaves us with a graph whose dimension is too low to pick out unique tangent hyperplanes for the argument.

We could improve this situation by thickening the graph around the constraint. We want to do so in such a way that maxima on the boundary are still maxima on the new graph. To do this, we introduce an extra parameter \(\lambda\) and define a function \[L(x,y,\lambda) = f(x,y) + \lambda c(x,y).\]

We can think of this modification as a family of ways of distorting the graph of \(f(x,y)\) in a way that preserve the constraint. Even better, if \(c(x,y) \neq 0\) then by construction increasing/decreasing \(\lambda\) slightly will increase/decrease the value of \(L(x,y,\lambda)\), so such points can never be extrema. If a point is on the boundary, then \(\lambda c(x,y) = 0\) regardless of \(\lambda\)’s value, and so it will (locally) increase or decrease only in response to changes in the \(x\) and \(y\) directions, and such a change is possible if it’s not one of the extrema on the boundary.

This leads us to conclude that the boundary extrema correspond to points at which the tangent hyperplane to \(L\) is flat. Concretely, this will be where \[\frac{\partial L}{\partial x} = 0\ \ \ \ \frac{\partial L}{\partial y} = 0\ \ \ \ \frac{\partial L}{\partial \lambda} = 0.\]

Note that the last equation is just \(c(x,y) = 0\).

Example 1.6 Suppose you want to maximize \(\sin(x+y)\) on the circle \(x^2 + y^2 = 1\). First, we make the Lagrange function \[L(x,y,\lambda) = \sin(x+y) + \lambda(x^2 + y^2 - 1).\]

Then we calculate the partial derivatives with respect to each variable:

\[\begin{align*} L_x &= \cos(x+y) + 2x \lambda,\\ L_y &= \cos(x+y) + 2y \lambda,\\ L_\lambda &= x^2 + y^2 - 1. \end{align*}\]

Finally, set these equal to zero and solve. Subtracting the first two immediately reduces us to \(x=y\), at which point we can solve the last equation for \(x=y=\pm 1/\sqrt 2\).

Exercises

Exercise 1.1 Using the definition, verify Proposition 1.1.

Exercise 1.2 Try using a degree two taylor polynomial, the best quadratic approximation, to estimate values of the function above. What do you notice?

Exercise 1.3 Assume \(f\) and \(g\) are differentiable at \(x=a\). Prove the sum rule, \[(f+g)'(a) = f'(a) + g'(a),\] from the BLA/Taylor definition. Remember, adding lines adds slopes.

Exercise 1.4 Suppose that we know \(f(x) = 2-(x-3) + E(x)\) near \(x=3\), where \(2+(x-3)\) is the BLA we know that that \(|E(x)|<1\) on the interval \((2,4)\).

Let \(g(x)=x^3 - 2x^2 + 3\). Determine the BLA for \(g\circ f\) at \(x=3\).
Can you determine a bound for the error term associated to the BLA of \(g\circ f\) on \((2, 4)\)? If so, provide it; if not, what information would you need to be able to do so?
Suppose we have a function \(h(x)\) differentiable at \(x = 0\), and such that \(h(0) = 3\). Can you bound the error for the BLA of \(f\circ h\) near zero? If so, provide a bound; if not, what information would you need to be able to do so?

Exercise 1.5 Let \(f(x)\) be a polynomial of degree at least \(1\) and \(a\) a real number. Prove that \[g(x) = \frac{f(x) - f(a)}{x-a}\] is also a polynomial.

Exercise 1.6 Let \(f(x)\) be a bounded function, meaning that there is a number \(B\) such that \(|f(x)| < B\) for all \(x\). Prove that \[x f(x) \to 0\ \ \ \ \textrm{as}\ \ \ \ x\to 0.\]

Exercise 1.7 Define the sequence \[a_n = \sum_{k=0}^n 2^{-k}.\] Prove that \(a_n\to 2\) as \(n\to \infty\) from the definition. You can assume that \(\lim_{n\to\infty} 2^{-n} = 0\).

Exercise 1.8 Show that the limit definition of the derivative is equivalent to the definition in terms of BLAs (first order taylor approximations).

Exercise 1.9 Suppose \(f\) is differentiable at \(x=a\). Show that \(f\) is continuous at \(x=a\).

Exercise 1.10 In this problem, we will use the BLA definition of the derivative, prove that if \(f,g,1/g\) are differentiable at \(x=a\) and \(g(a)\neq 0\), then \(f/g\) is differentiable at \(x=a\) and its derivative is \[\left(\frac f g\right)'(a) = \frac{f'(a)g(a) - f(a)g'(a)}{g(a)^2}.\]

Calculate the derivative of \(\dfrac 1 g\) by expressing the identity \[g \frac 1 g = 1\] in terms of BLAs.

(a’) Alternatively, calculate the derivative of \(f(x) = 1/x\). Then apply the composition rule to determine the derivative of \(\dfrac 1 g\).

Combine the result of (a) or (a’) with the product rule.

Exercise 1.11 Let \(n>m\). Show that if \(f\) converges to zero at \(x=a\) with order \(n\), then it also converges to zero with order \(m\) too.

Exercise 1.12 Let \(P(t)\) be a function defined almost everywhere which is known to have the following properties: \[P(1) = 4\ \ \ \ P'(t) = tP(t)^2.\]

Using a best quadratic approximation at \(t=1\), estimate \(P(0)\) and \(P(-1)\).
Determine \(P''(t)\) (in terms of \(P\). Assuming the function is defined at \(t=0\), do you think the critical point at \(t=0\) is a minimum, a maximum, or neither? Explain. Hint: squares are always positive.
Looking at your answers to the previous parts, do you think your approximations are accurate? Do you think you have over- or under-estimated the true values? Briefly explain your reasoning (you do not need to provide a proof).

Programming Problems

Exercise 1.13 The limit definition of the derivative of \(f(x)\) at \(x=a\) is

\[f'(a) = \lim_{x\rightarrow a} \frac{f(x) - f(a)}{x-a}.\]

If the limit doesn’t exist, we say that \(f\) is not differentiable at \(x=a\).

The right hand line is the slope of a secant line, and is approximately equal to the derivative when \(x\) is close to \(a\).

Write a function which takes \(f\), \(x\), and \(a\) as input and estimates the derivative. It must raise an ArithmeticError if \(x=a\).

Exercise 1.14 It is sometimes the case, though not always, that the limit which appears in the limit definition of the derivative converges in a nice way: if a secant slope \(m_0\) for \(x_0\) is known, and a secant slope \(m\) for some \(x\) much, much closer to \(a\) is not very different from \(m_0\), then \(m\) is close to \(f'(a)\).

Write a program as follows:

input: a function \(f\), point \(a\), point \(x\), and a tolerance \(e\).
at each step, the secant slope is calculated, then the initial point is moved to the average of \(x\) and \(a\) (i.e. twice as close as before)
the current secant slope is compared to the previous secant slope; if the difference is less than the tolerance, return the current slope

You should reuse the function from Problem 1 in this problem.

Exercise 1.15 Write a function which estimates \(f(1)\) for a function \(f(x)\) which satisfies the following conditions:

\(f(0) = 1\)
\(f'(x) = f(x)\)

by using \(n\) even BLA steps (compare Example 1.2). Use exact arithmetic.